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Objectives: The Korean government has enacted two laws, namely, the Personal Information Protection Act and the Bioeth- 
ics and Safety Act to prevent the unauthorized use of medical information. To protect patients' privacy by complying with 
governmental regulations and improve the convenience of research, Asan Medical Center has been developing a de-identifi- 
cation system for biomedical research. Methods: We reviewed Korean regulations to define the scope of the de-identification 
methods and well-known previous biomedical research platforms to extract the functionalities of the systems. Based on these 
review results, we implemented necessary programs based on the Asan Medical Center Information System framework which 
was built using the Microsoft .NET Framework and C#. Results: The developed de-identification system comprises three 
main components: a de-identification tool, a search tool, and a chart review tool. The de-identification tool can substitute a 
randomly assigned research ID for a hospital patient ID, remove the identifiers in the structured format, and mask them in 
the unstructured format, i.e., texts. This tool achieved 98.14% precision and 97.39% recall for 6,520 clinical notes. The search 
tool can find the number of patients which satisfies given search criteria. The chart review tool can provide de-identified 
patient's clinical data for review purposes. Conclusions: We found that a clinical data warehouse was essential for successful 
implementation of the de-identification system, and this system should be tightly linked to an electronic Institutional Review 
Board system for easy operation of honest brokers. Additionally, we found that a secure cloud environment could be adopted 
to protect patients' privacy more thoroughly. 
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I. Introduction 

As Electronic Medical Record (EMR) systems have been 
widely adopted, research using EMR data has been wide- 
spread due to its ease of accessing large amounts of clinical 
data. Therefore, concerns regarding the privacy and security 
of patients' medical records have been highlighted. In the 
United States, the Health Insurance Portability and Account- 
ability Act (HIPAA) defined guidelines for the secondary 
use of medical records. Based on the HIPAA, the Office 
for Civil Rights published a guideline for de-identification 
of medical records recently [1]. Korean government also 
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enacted two laws, namely, the Personal Information Protec- 
tion Act [2] and the Bioethics and Safety Act [3], to prevent 
the unauthorized use of medical information. In particular, 
the revised Bioethics and Safety Act extends its scope from 
embryo research and genetic research to general biotechnol- 
ogy research to prevent any infringement of human dignity. 
Therefore, the revised act applies to both prospective studies 
and retrospective studies. However, it is almost impossible to 
obtain each participant's informed consent for most cases of 
retrospective studies. Therefore, Korean governmental regu- 
lations also suggest the de-identification of personal health 
information as an alternative. 

To protect patients' privacy by complying with the gov- 
ernmental regulations and improve the convenience of re- 
search, Asan Medical Center (AMC) has been developing a 
biomedical research platform. Based on thorough review of 
the two Korean regulations and well-known previous bio- 
medical research platforms which support de-identification, 
we implemented the prototype of a de-identification system. 
The de-identification system proposed in this paper includes 
not only identifier removal methods but also the necessary 
platform, such as user client programs and the interfaces for 
other programs. Since AMC is the biggest hospital in Korea 
and it has actively adopted health information technology to 
improve the quality of care and to make a clinical workflow 
more efficient [4] , our experience and lessons learned from 
developing the de-identification system will be helpful to 
other hospitals. 

II. Methods 

We first reviewed the Personal Information Protection Act 
and the Bioethics and Safety Act to define the scope of the 
de-identification methods. Then, we also investigated repre- 
sentative previous biomedical research platforms, such as the 
Stanford Translational Research Integrated Database Envi- 
ronment (STRIDE) [5], Informatics for Integrating Biology 
and the Bedside (i2b2) [6], and the Research Patient Data 



Registry (RPDR) [7] since de-identification will serve as a 
part of the research platform. 

1. Review of Regulations 

The scope of de-identification for biomedical research in 
Korea can be categorized into three parts as shown in Table 

1. First, we have to encrypt sensitive data such as the Korean 
resident registration number, which is similar to the Social 
Security number in the United States, when we store those 
data in a database system. Since the Korean resident regis- 
tration number is a unique life-long personal identifier and 
is used as a unique key to distinguish a specific person, this 
number must be encrypted securely. Also, the Institutional 
Review Board (IRB) process should be tightly linked to the 
de-identification process. Second, protected health informa- 
tion (PHI) must be removed if researchers do not have the 
proper informed consent. We have to de-identify not only 
all direct identifiers but also all possible quasi-identifiers. 
Quasi-identifiers are values of variables within a dataset that 
are not unique but might be empirically specific by combin- 
ing them. Last, we should establish a bio-bank to manage 
human materials. In this paper, we will focus on the de- 
identification of medical information. 

2. De-identification System Design 

STRIDE is a standard-based informatics platform supporting 
clinical and translational research [5,8]. It comprises three 
main databases, namely, a clinical data warehouse (CDW), a 
bio-specimen database, and research databases. Above those 
databases, an anonymous cohort identification tool, a patient 
cohort data review tool, clinical data extraction, research 
data management, and bio-specimen data management are 
served. STRIDE follows the HIPAA rules for de-identifica- 
tion. i2b2 integrates data from multiple sources, combines 
research data with clinical data, and focuses on cohorts and 
patient populations without PHIs [6,9]. RPDR is a central- 
ized clinical data registry. Researchers access this data using 
the RPDR online query tool with user-defined queries to ex- 



Table 1. Scope of de-identification in Korean regulations 



Object 


Scope 


Remark 


Basic 


Encryption 


Encrypting sensitive data 




IRB process connection 


Tight binding with IRB process 


Medical information 


De-identification 


Removing all direct identifiers 


Removing all quasi-identifiers 


Human material 


Bio-bank 


Establishing human material (blood, tissue) management system 


De-identifying all relating information 



IRB: International Review Board. 
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Figure 1. Diagram of Asa n Medical 
Center de-identification 
system. EMR: Electronic 
Medical Record, COPE: 
computerized physician 
order entry, LIS: laboratory 
information system, CDW: 
clinical data warehouse, 
IRB: International Review 
Board, eCRF: electronic case 
report form, DB: database. 



tract the aggregate numbers of patients and, with proper IRB 
approval, obtain detailed clinical data. RPDR secures patient 
information by controlling and auditing the distribution of 
patient data within the guidelines of the IRB and with the 
use of several built-in, automated security measures [7,10]. 

Based on the denned scopes and review of other systems, 
we designed an overall de-identification system and related 
processes as shown in Figure 1. Physicians, nurses, or other 
medical staff members generate the clinical data using the 
hospital information system, including EMR, computerized 
physician order entry (CPOE), or laboratory information 
management system. In this step, all data should be iden- 
tifiable for proper patients' care. The de-identification tool 
removes the structured identifiers, such as, names, telephone 
numbers, and patient IDs. It also masks the identifiers in 
the unstructured data, i.e., names in the text, using regular 
expressions. Only honest brokers which are humans or a 
system can reverse this process. The de-identified data and 
identifiable data are stored in CDW and reorganized for 
easy search and analysis. Also, research data from clinical 
trials (electronic case report forms), disease registries, or 
the researcher's own data which stored as Microsoft Access 
or Excel format can be transferred in CDW. If necessary, 
public biomedical databases may be linked into CDW. Users 
can access this system using two clients, such as the search 
tool and the chart review tool. The search tool should have 
a user-friendly interface and support ad-hoc queries, since 
researchers want to check the size of the possible research 
cohort which satisfies the necessary conditions. Users can 
review the de-identified clinical data using the chart review 
tool. The de-identified clinical data should include all possi- 
ble medical records related to each patient, including disease 
names or codes, operation names or codes, laboratory re- 
sults, medication, and progress notes. For stricter protection, 
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Figure 2. Overall system architecture of Asan Medical Center 
(AMC) biomedical research platform. ICD: Internation- 
al Classification of Diseases. 



privilege should be checked and maintained by IRB or other 
an internal privacy board regularly. If users need the identifi- 
able data or extract data, they should contact honest brokers 
with the proper informed consents and IRB approval. 

3. System Implementation 

The overall system architecture stack is shown in Figure 2. The 
de-identification system is based on diverse layers. The net- 
work layer, hardware layer, and database are located at the bot- 
tom of the stack. The terminology layer is located above them. 
AMC uses local codes for diagnosis, procedure, medication, 
operation, and laboratory tests, and AMC maps the necessary 
codes to the standard terminology, such as the International 
Classification of Diseases (ICD)- 10 and ICD-9-CM. The data 
are stored in CDW, the bio-bank database, or the research da- 
tabase, respectively. We implemented the necessary programs 
based on the Asan Medical Center Information System (AMIS) 
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framework. The AMIS framework library was implemented 
using the Microsoft .NET Framework and CM. We developed 
all clients based on the AMC enterprise data warehouse 
(EDW) which has been used for both clinical research and 
hospital management since 2001. However, we are developing 
a new dedicated clinical research data warehouse to support 
researchers more efficiently. 

To focus on development of the de-identification, we will 
introduce only three tools in this paper: the de-identification 
tool, data review tool, and data search tool. They are indi- 
cated in gray in Figure 2. 

III. Results 

1. Research ID Generation 

To reinforce the protection of privacy, we replace patient IDs 
with research IDs when those patient IDs are requested for 
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the purpose of data review. We use a random number gen- 
erator to create the candidate research ID for each patient 
and then check for duplication as shown in Figure 3. There- 
fore, we can assign a unique randomly generated ID for each 
patient. Since we will give this research ID to users when 
they are reviewing de-identified data, we decided not to use 
a hash function. The hash code is too long and complicated 
for this purpose. Also, the random number generation func- 
tion returns 8 - digit random numbers to maintain a format 
similar to that of the patient ID. Only honest brokers can ac- 
cess the mapping table of patient IDs and research IDs. 

2. Direct Identifier Removal 

We tried to remove 20 PHIs defined by the AMC Privacy 
& Security Board as shown in Table 2. Since there is no de- 
tailed governmental definition of PHIs in Korea, i.e., HIPAA 
PHI definition, we defined the institutional PHIs. The major 



Patient ID 



Randomize 



.Yes 



Check duplication 



No 



-> Research ID Figure 3. Flowchart for generating 
Research ID. 



Table 2. Twenty protected health informations defined by Asan Medical Center 



No 


PHI 


Remark 


1 


Patient names 


Excluding physicians name 


2 


Address details 


Smaller than -dong, -eup, and -myun 


3 


Phone numbers 


Including mobile phone numbers and fax numbers 


4 


Email addresses 




5 


Korean resident registration numbers 




6 


Foreigner registration numbers 




7 


Passport numbers 




8 Health insurance numbers 


9 


Bank account numbers 




10 


Credit card numbers 




11 


Certificate/license numbers 




12 


Vehicle license plate numbers 




13 


Patient IDs 




14 


Hospital membership IDs 


Homepage, referral system 


15 


Hospital employee numbers 




16 


IP addresses 




17 


URLs 




18 


Biometric identifiers 


Fingerprint, retinal, vein, voice prints, and other personally 






identifiable genetic information 


19 


Full face photographic images and any comparable images 




20 


Any other unique identifying numbers 


Pathology numbers 
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difference between the PHIs of AMC and those of HIPAA is 
that AMC did not define the date directly related to an indi- 
vidual as PHI. Researchers in AMC strongly insist that the 
date is necessary for clinical research. 

The identifiers in the structured format were easily re- 
moved. However, it is complicated to remove the identifiers 
in the unstructured format such as free texts. There have 
been several previous works on the automatic de-identifi- 
cation of textual data in EMR [11,12], and some tools have 
shown reliable performance [13,14]. However, physicians in 
AMC wrote free texts using Korean as well as English. There- 
fore, it was hard to apply English-oriented de-identification 
tools. To overcome this problem, we applied a heuristic 
approach using regular expressions as a first step [15]. We 
masked the 20 PHIs in free texts as shown in Figure 4. The 
developed method was verified by 6,502 carefully chosen 
clinical notes of 66 types, including inpatient, outpatient, 
emergency room, and operating room notes. Those clinical 
notes were written by 498 different physicians. Five human 
annotators reviewed all of the notes manually to confirm 
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performance of the automatic method. The de-identification 
tool achieved 98.14% of precision and 97.39% of recall. Here, 
precision means that the ratio of the correctly masked PHIs 
versus the masked data and recall represents the ratio of the 
successfully masked PHIs versus the entire PHIs. There were 
1,861 PHIs in the 6,502 clinical notes. Among them, 1,837 
PHIs were accurately masked, 18 non-PHIs were removed, 
and 24 PHIs still remained. When reviewing, we could find 
only 4 PHIs, which were phone numbers, patient names, ad- 
dresses, and patient IDs. Other PHIs were not found in free 
texts. 

3. Quasi-ldentifier Removal 

Since Korean regulations strictly prohibit the use of quasi- 
identifiers for research purpose, we adopted fc-anonymity 
[16,17]. fc-anonymity prevents the identification of a patient 
when there are less than k similar data. Though there is no 
standard on deciding k, El Emam [16] and El Emam et al. 
[17] proposed 5-anonymity in health records by a rule-of- 
thumb. It means that if there are less than 5 patients' data, 
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Figure 5. User interface of search 
tool. In the left panel, user 
can choose the search cri- 
teria such as medication, 
order, lab results, and 
diagnosis. User can set 
the detailed search pa- 
rameter in the upper right 
panel. In this figure, user 
searched the total number 
of outpatients in October 
9, 2012. The lower right 
panel shows the search 
results. 
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anonymity can be guaranteed. Based on this rule-of-thumb, 
we simply do not provide the results which have less than 5 
patients' data to guarantee 5-anonymity in a simply way. 

4. Search Tool 

A screenshot of the AMC anonymized search tool is shown 
in Figure 5. Users can find the number of patients by setting 
several phenotypes including a diagnosis name/code, an 
operation name/code, lab results, or medication. A user se- 
lects search criteria in the left panel by double-click or drag- 
and-drop, and sets the detailed conditions of the selected 
criteria in the upper right panel. Finally, the search results 
are displayed in the lower panel with graphs. When deciding 
the detailed conditions, a researcher can use diverse opera- 
tors, such as 'equal', 'bigger than, 'between, and other neces- 
sities. The left graph in the result panel shows the number of 
patients categorized by sex, and the right one presents the 
number of patients by age groups. In Korea, ethnic group is 
not important. 

5. De-identified Chart Review Tool 

Figure 6 depicts a user interface of the chart review tool. 



Development of De-identification System 

Figure 6A shows the diagnosis and medication tab which in- 
tegrates AMC local diagnosis codes, ICD-10, and the related 
codes which were input by physicians. This tab also provides 
all medication orders with related drug information. The lab 
result tab, as seen in Figure 6B, shows the individual labora- 
tory results as well as the overall trend of the chosen test. Ra- 
diology and pathology reports are also reviewed as in Figure 
6C. The operative reports tab, as seen in Figure 6D, provides 
operation names, operative diagnosis names, and ICD-9-CM 
codes. The EMR tab displays textual information in progress 
notes, admission notes, and discharge summary as shown in 
Figure 4. 

Using the given research ID, the user can access the de- 
identified patient's chart including diagnosis, medication, lab 
results, radiology and pathology reports, operative reports, 
progress notes, admission notes, and discharge summary. 
Thus, researchers can review all medical records related to 
the chosen patient. 

IV. Discussion 

We implemented the prototype of a de-identification system 
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Figure 6. User interface of chart review tool. (A) Diagnosis ft medication, (B) lab results, (C) radiology ft pathology reports, and (D) opera- 
tive reports. 
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designed to comply with Korean governmental regulations. 
Lessons we learned by implementing the system at the larg- 
est medical center in Korea can be summarized as follows. 
First, we reconfirmed that a data warehouse is essential for the 
successful implementation of the de-identification system, as 
most of the previous de-identification systems are based on 
data warehouses. Without an EDW in AMC, it would be al- 
most impossible to implement the prototype system since the 
data in legacy systems, such as EMR and CPOE, cannot be de- 
identified. Second, a clinical research data warehouse is more 
suitable than the usual EDW. Some hospitals have imple- 
mented de-identification systems using EDW, for example, the 
Ohio State University Medical Center Information Warehouse 
[18]. However, there are many benefits of having a dedicated 
clinical research data warehouse for the de-identification sys- 
tem. Researchers usually require a large amount of raw data 
instead of summarized reports. Also, the de-identification 
system requires unique tools and processes, i.e., honest broker, 
de-identified ad-hoc query interface, anonymized chart re- 
view, and IRB approval interface, which are not necessary for 
EDW. Third, an electronic IRB system (e-IRB) is needed, and 
it must have an interface for easy and automatic transferring 
of approval or waiver into the de-identification system. If there 
is only a paper-based IRB system or e-IRB without an inter- 
face to the de-identification system, the IRB approval must be 
checked manually. This is time consuming and inconvenient 
for researchers. Last, when extracting identifiable data with 
IRB approval, a digital rights management software or a se- 
cure private cloud is necessary to prevent data breach caused 
by hacking or carelessness of researchers. Data breach makes 
it meaningless to protect patient's privacy using the de-identi- 
fication system. Also, a secure cloud system can offer powerful 
computing resources to handle big data such as genomic data. 
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